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EMPIRICAL STATIONARY CORRELATIONS FOR 
SEMI-SUPERVISED LEARNING ON GRAPHS 1 

By Ya Xu 2 , Justin S. Dyer 3 and Art B. Owen 

Stanford University 

In semi-supervised learning on graphs, response variables ob- 
served at one node are used to estimate missing values at other nodes. 
The methods exploit correlations between nearby nodes in the graph. 
In this paper we prove that many such proposals are equivalent to 
kriging predictors based on a fixed covariance matrix driven by the 
link structure of the graph. We then propose a data-driven estimator 
of the correlation structure that exploits patterns among the observed 
response values. By incorporating even a small fraction of observed 
covariation into the predictions, we are able to obtain much improved 
prediction on two graph data sets. 

1. Introduction. Data on graphs has long been with us, but the recent 
explosion of interest in social network data available on the Internet has 
brought this sort of data to prominence. A typical problem is to predict 
the value of a feature at one or more nodes in the graph. That feature is 
assumed to have been measured on some, but not all, nodes of the graph. 
For example, we might want to predict which web pages are spam, after a 
human expert has labeled a subset of them as spam or not. Similarly, we 
might want to know on which Facebook web pages an ad would get a click, 
although that ad has only been shown on a subset of pages. 

The underlying assumption in these prediction problems is that there is 
some correlation, usually positive, between the values at vertices that are 
close to each other in the graph. By making predictions that are smooth 
with respect to a notion of distance in the graph, one is able to define a 
local average prediction. 
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This problem is often called semi-supervised learning, because some of 
the data have measured response values, while others have predictor only. 
We suppose that the response random variable at node i of the graph is Yi. 
The observed value yi is available at some, but not all, i. For a survey of 
semi-supervised learning see Zhu (2005). 

Our starting point is the graph based random walk strategy of Zhou, 
Scholkopf and Hofmann (2005). To describe their approach, let G be a 
weighted directed graph with n vertices. The edges of G are represented 
by an adjacency matrix W with entries Wij > if there is an edge from i to 
j, and Wij = otherwise. We impose via = 0, so that if the graph contains 
loops, we do not count them. Node i has out-degree Wi + = X^?=i w ij an d m_ 
degree w +i = YTj=i w ji- Tne volume of the graph is w ++ = Y™ =x YTj=\ w ij- 

There is a natural random walk associated with in which the prob- 
ability of transition from i to j is Py = Wij/u)i+. Very often this walk is 
irreducible and aperiodic. If not, it may be reasonable to modify W, by, for 
example, adding a small probability of a transition uniformly distributed on 
all nodes. For example, such a modification is incorporated into the PageR- 
ank algorithm of Page et al. (1998) to yield an irreducible and aperiodic 
walk on web pages. 

An irreducible and aperiodic walk has a unique stationary distribution, 
that we call it, which places probability 7Tj on vertex i. Zhou, Huang and 
Scholkopf (2005) define the similarity between % and j to be Sij = TTiPij + 
ftjPji- Then they construct a variation functional for vectors Z G M. n defined 
on the nodes of G: 



This variation penalizes vectors Z that differ too much over similar nodes. 
It also contains a scaling by yJWi- One intuitive reason for such a scaling is 
that a small number of nodes with a large 7Tj could reasonably have more 
extreme values of Zj, while the usually much greater number of nodes with 
small 7Tj should not ordinarily be allowed to have very large Zi, and hence 
should be regularized more strongly. Mathematically, the divisors y/Wl orig- 
inate in spectral clustering and graph partitioning algorithms. 

The prediction Z should have a small value of Q(Z). But it should also 
remain close to the observed values. To this end, they make a vector Y* 
where Y* = yi when j/j is observed and Y* = fii when yi is not observed, 
where /ij is a reasonable guess for Yi. Then the predictions are given by 



where A > is a parameter governing the trade off between fit and smooth- 
ness. 
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The minimizer Y in (2) is a linear function of Y*. In very many of the 
applications tji € {— 1, 1} is binary and Y* = is used to represents uncer- 
tainty about their value. Although linear models may seem less natural than 
logistic regressions, they are often used for discrete responses because they 
are a computationally attractive relaxation of the problem of minimizing a 
quantity over Z E { — 1, l} n . 

The outline of this paper is as follows. Because the random walk smoother 
leads to a linear method, we might expect it to have a representation as a 
minimum mean squared error linear prediction under a Gaussian process 
model for Z. That is, it might be a form of kriging. Section 2 presents nota- 
tion for kriging methods. Section 3 exhibits a sequence of kriging predictors 
that converge to the random walk semi-supervised learning prediction (2). 

The kriging model which yields random walk smoothing has a covariance 
assumption driven by the geometry of the graph, in which the y/Wl values 
play a very strong role. Section 4 explores some other graph based semi- 
supervised learning methods which have different covariance assumptions in 
their corresponding kriging models. In Section 5 we derive another kriging 
method incorporating the empirical variogram of the observed Y{ values 
into an estimate of the covariance. That method uses a full rank covariance, 
which is therefore computationally expensive for large n. We also present a 
lower rank version more suitable to large scale problems. 

Section 6 presents two numerical examples. In Section 6.1 Y{ is a numer- 
ical measure of the research quality of 107 universities in the UK and Wij 
measures the number of links from university % to j . In holdout comparisons 
our kriging method is more accurate than the random walk smoother, which 
ends up being quite similar to a linear regression on values without an 
intercept. Section 6.2 presents a binary example, the WebKB data set, where 
the response is 1 for student web pages and —1 otherwise. Incorporating em- 
pirical correlations into the semi-supervised learning methods brings a large 
improvement in the area under the ROC curve. 

Section 7 describes some simple adaptations of the approach presented 
here. Section 8 discusses some related literature in fields other than machine 
learning. Section 9 has our conclusions. 

Our main contribution is two-fold. First, to the best of our knowledge, 
we are the first to recognize the kriging framework underlying several re- 
cently developed semi-supervised methods for data on graphs. Second, we 
propose an empirical stationary correlation kriging algorithm which adapts 
the traditional variogram techniques in Euclidean space to graphs. 

2. Kriging on a graph. Kriging is named for the mining engineer Krige, 
whose paper Krige (1951) introduced the method. For background on kriging 
see Stein (1999) or Cressie (1993). Here we present the method and introduce 
the notation we need later. 



4 



Y. XU, J. S. DYER AND A. B. OWEN 



A plain model for kriging on a graph works as follows. The data Y G W 1 
are written 

(3) Y = Xp + S + e. 

Here X E M. nxk is a matrix of predictors and (3 E IR fc is a vector of coefficients. 
The structured part of the signal is S ~ M(0, E) and it is the correlations 
within E that capture how neighbors in the graph are likely to be simi- 
lar. Finally, e~AA(0,r) is measurement noise independent of S. The noise 
covariance T is diagonal. 

The design matrix X has one row for each node of the graph. It can include 
a column of ones for an intercept, columns for other predictors constructed 
from the graph adjacency matrix W and columns for other covariates mea- 
sured at the nodes. To emphasize the role of the graph structure, we only 
use covariates derived from W. At first we take X G R n , so k = l, and then 
make f3 random from A/"(/U, independent of both e and S. We write the 
result as 

(4) Y = Z + e, 

where Z ~ Af(pX, for * = S^XX' + E, independently of e ~ A/"(0,r). 

In this formulation the values Y that we have observed are noisy mea- 
surements of some underlying quantity Z that we wish we had observed. We 
seek to recover Z from measurements Y. 

Some of the Yj are observed and some are not. None of the Zi are observed. 
We let Y^ ) denote the random variables that are observed, and be the 
values we saw for them. The unobserved part of Y is denoted by Y^\ The 
kriging predictor is 

(5) z = E(z|y(°)). 

Without loss of generality, suppose that the vectors are ordered with ob- 
served random variables before unobserved ones. We partition ^ as follows: 

so that, for example, # o = cov(Z^°\Z^) and *. = cov(Z, Z^). The ma- 
trices E and V are partitioned the same way. 
The joint distribution of Z and Yw is 

Y(°)J ^U^V'Uo. ^oo + Too, 

where X( ' contains the components of X corresponding to Y^°\ Now we 
can write the kriging predictor (5) explicitly as 

Z = tf. (*oo + Too) -1 ^ - ^ (0) ) + fiX 
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(6) = (XxW/6 + E. )(X^X^'/5 + S 00 + Too)" 1 

x (y(°) - ^Sf(°)) + nX. 

In the special case where the whole vector Y = y is observed, the kriging 
predictor is 

(7) Z = (XX' /S + E)(XX'/S + £ + T)" 1 ^ - fiX) + [MX. 

We have presented the kriging method under a Gaussian framework, 
where estimators (6) and (7) are the conditional expectations and hence are 
the best predictors in terms of minimizing mean squared error (MSE). How- 
ever, even without the Gaussian assumption, estimator (6) and hence also 

(7) is the best linear unbiased predictor (BLUP) of Z, because it minimizes 
the MSE among all linear unbiased predictors [see Stein (1999), Chapter 1]. 
For background on the BLUP, see Robinson (1991). 

Letting 5 — > leads to a model with an improper prior in which Y has 
infinite prior variance in the direction given by X. One natural choice for X 
is the vector l n of all Is. Then the mean response over the whole graph is 
not penalized, just fluctuations within the graph. We will see other natural 
choices for X. 

When these matrices are unknown, one can apply kriging by estimating 
E, T and fi, and then predicting by (6). The kriging approach also gives 
expressions for the variance of the prediction errors: 

(8) var(Z | Y<® = y (0) ) = * - *. (^oo + Poo) -1 ^.- 

The predictions do not necessarily interpolate the known values. That is, 
Z(°) need not equal Y^°\ Instead some smoothing takes place. The predic- 
tions can be forced closer to the data by making Too smaller. One reason 
not to interpolate is that when the graph correlations are strong, it may be 
possible to detect erroneous labels as cases where — Y^\ is large. 

3. Random walk smoothing as kriging. Here we show that random walk 
regularization can be cast in terms of a sequence of kriging estimators. 
The random walk regularizer predicts the responses Y by 

(9) Y = argmin \ £ Sij f^L - + \\\Z - Y*\\\ 

z 2Z ~f \V^i V^jJ 

where Y* = Yi for observed values and Y* = \ii for the unobserved values. 
Zhou, Huang and Scholkopf (2005) take Hi = for unobserved yi G {—1, 1}. 

Introduce the matrix n = diag(7Ti, . . . ,vr n ), let Si + = Y^j=x s ij an d let A 
be the matrix with elements 
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The matrix A is the graph Laplacian of G, which is our original graph G 
after we replace the weights Wij by the similarities Sij. When G is undirected, 
the graph Laplacian A of G is 

_(w i+ -Wii, i = j, 

It is clear that A and A are symmetric matrices with an eigenvalue of 
corresponding to eigenvector l n . Assuming that a graph such as G or G 
is connected, its Laplacian is positive semi-definite and has rank n — 1 [von 
Luxborg (2007)]. For later use we write 

n 

(10) A = U'diag{d 1 ,d 2 , . . . ,d n _i,0)£/ = ^ j d i u i u' i , 

i=i 

where U'U = I n , with di > for i < n and d n = 0. 

In matrix terms, the right-hand side of equation (9) is 

zTr 1/2 Arr 1/2 z + x(z - y*)\z - y*) 

= ^'(n^Air 1 / 2 + xi)z - 2\z'y* + xy*'y*. 

For A > this is a positive definite quadratic form in Z and we find that 

(11) ? = A(n~ 1 / 2 An~ 1 / 2 + xi)^y* = (i + A- 1 n- 1 / 2 An- 1 / 2 )- 1 y*. 

Now we are ready to present the existing random walk algorithm as a 
special form of kriging. To get the random walk predictor (11), we do the 
following: 

(1) make strategic choices for T, £ and X , 

(2) treat the missing parts of Y as observed, 

(3) use the full data kriging estimator (7), and then 

(4) take the limit as 5 — > from above. 

In detail, the recipe is as follows: 

Theorem 1. Let Y = Z + e e W 1 . Suppose that Z = Xf3 + S, where 
let", j3~N{n,l/5) and S ~A/"(0,S). Let e~ iV(0,r) and assume that 
S , (3 and e are mutually independent. Suppose that Y^ ' comprising the first 
r > 1 elements of Y is observed. Let Y* E ]R n with Y* = Y^ for i = l,...,r 
and Y* = \iXi for i = r + 1, . . . ,n. Let Z$ be the kriging estimator (7) applied 
with y = Y*. Assume that the Laplacian matrix A derived for the similarity 
weighted graph G on which Y is defined satisfies (10). We now choose 

r = A" 1 /, 

£ = n 1 / 2 A+n 1 / 2 , and 
x = (v^> • • • i v 7 ^")'' 
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where A + is the Moore-Penrose inverse of A andU = diag(7Ti, . . . ,w n ). Then 

Hm zt = {i + A _1 n _1 / 2 An _1 / 2 ) _1 y*, 

<5->0+ 

which is the random walk predictor given by (11). 

Proof. First we notice that X = II 1 / 2 l n . The kriging estimator is = 
M S (Y* - fiX) + fiX, where 

(xx' \ (xx' ^ - 1 
M 5 = —=- + s — — + s + r 
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(12) = fx^L + n 1 / 2 A+n 1 /^ fx* + n 1 / 2 A+n 1 / 2 + a- 1 /) 

= n v 2 (hjk + a+^) n 1 / 2 (V/ 2 fhjk + a+) n 1 / 2 + \-^A ' . 

The three matrices on the left of (12) are invertible. Moving their inverses 
inside the matrix inverse there, we get 



Using (10), we write 

^+A+ = C/'diag(l , * .,-J_ "V 
noting that the last column of U is ±l n /y/n, the constant eigenvector. Now 

M a = ^J + A- 1 n- 1 / 2 ^'diag^di,cfe > ... > d^i,^W- 1 / 2 ^ . 
Letting <5->0+, 

m 5 -> m = (/ + A -1 n- 1/2 An -1/2 ) -1 . 

This limit exists because the matrix being inverted is positive definite. 
The terms related to the mean fiX vanish because 

(m x -x) = {i + A^n-v^rr 1 / 2 )- 1 !! 1 / 2 ^ - n 1/2l n 

= (/ + A" 1 n- 1 / 2 An- 1 / 2 )- 1 (A- 1 n- 1 / 2 Ai n + n 1 / 2 ^) - n l / 2 i n 
= (/ + A" 1 n^ 1 /2An- 1 /2 ) -i 

x (a-^-^ah-V 2 + /)nV 2 i„ _ u 1 / 2 ^ 

= 0. 
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The second equality follows because Al n = 0. Therefore, in view of (11), 
Z* s -+Y as <5^0+. □ 

One thing that stands out from the kriging analysis is the vector X = y/it 
interpreted component- wise. The equivalent prior on Y in the direction par- 
allel to X is improper. Thus, the method anticipates that Y could reasonably 
be a large multiple of X. When Y = (3X for some value /3 ^ 0, the similar 
nodes are the ones with comparable values of yji^l. These are not necessarily 
close together in the graph. 

The next thing that stands out is that the correlation strength between 
nodes is a fixed property of W, the graph adjacency matrix. If some response 
variables have stronger local correlations, others weaker, and still others 
negative local correlations, that is not reflected in this choice of £. 

4. Other semi-supervised learning as kriging. There are several other 
graph based semi-supervised learning methods that can be expressed in a 
similar regularization framework. In this section we build a similar con- 
nection between some of these other semi-supervised learning methods and 
kriging. We state several counterparts to Theorem 1 but omit their proofs 
because the details would be repetitive. Most of these examples are taken 
from the survey paper Zhu (2005). Some of these methods were originally 
introduced with general loss functions, but we only consider their squared 
error loss versions. This is because Y may not be linear in Y* under a general 
loss function and hence is no longer kriging. 

In each case there is a quadratic variation Q(Z) and a quadratic error 
norm on Z — Y* , each of which should ideally be small subject to a tradeoff 
between them. We take Q(Z) = Z'LZ for a smoothing matrix L and measure 
the error between Z and Y* by [Z — Y*)'A(Z — Y*). The smoothing matrix 
L is positive semidefinite and A is a diagonal matrix with An > 0, while the 
sum L + A is invertible. The algorithm then picks the minimizer of 

(13) Q(Z) = Z'LZ + {Z -Y*)'A{Z -Y*). 
It is easy to show that 

(14) Y = aigmmQ(Z) = (L + A)~ 1 AY*. 

z 

For random walk smoothing in Section 3, A is XI and L is n _1 / 2 AII _1 / 2 . 

Random walk smoothing is defined for directed graphs. A few of the 
methods discussed below are only defined for undirected graphs. To apply 
one of them to a given directed graph, the standard technique is to work 
with W + W. 

We build the connection to kriging for several semi-supervised learning 
methods below. Then, to allow easy comparison of the methods, we present 
a summary in Table 2 in Section 5.1. 
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4.1. Example one: Belkin, Matveeva and Niyogi (2004)- Belkin, Matveeva 
and Niyogi (2004) consider undirected graphs and use the (symmetric) edge 
weights Wij as similarities Sj,- . Their Tikhonov regularization algorithm uses 
a criterion proportional to 

Q(Z) = Z'AZ + \ \\ZW -y(°)|| 2 , 

where A is the graph Laplacian of G (not G). They also have an option 

to use the side constraint ^Yli=i^ = r Z)i=i^i ■ That constraint forces 
the mean prediction to equal the mean observation, and is necessary for the 
generalization bound they obtained. We do not use this condition, because 
the squared error norm on Z(°) - y(°) already forces Z<°) to be close to Y^. 

Their method fits the quadratic criterion (13) after making the substitu- 
tions L = A and A = diag(Aoir> 0/ n _ r ). The solution Y is the kriging esti- 
mator (7) with the following choices: 

T = diag(Ao 1 I r , X^In-r), 

£ = A+, and 

X = l n , 

taking the limit as 5 — > and then Ai — > 0. 

There are two key differences between this method and random walk 
smoothing. First, neither £ nor X involve y/n here. Second, this model 
uses a diffuse prior on the noise for the unobserved responses, while random 
walk smoothing uses the same variance for both observed and unobserved 
responses. Therefore, this method avoids plugging in a guess for the unob- 
served Y^\ and is thus more typical of statistical practice. 

Belkin, Matveeva and Niyogi (2004) also propose an interpolating algo- 
rithm that leaves all the known values unchanged in the prediction. That 
is, Y^ = Y^ for i = 1, . . . , r. The resulting prediction arises in the limit as 
Ao — > 00 for the Tikhonov estimator and, hence, the connection to kriging 
remains the same. 

They consider the generalization that replaces A by A p for a positive 
integer power p. They also consider a generalization in which there could be 
more than one measurement made on the response variable at some of the 
nodes. We do not consider cases more general than or 1 observed response 
values per node. 

4.2. Example two: Zhou et al. (2004 )■ Zhou et al. (2004) present an undi- 
rected graph algorithm that is a predecessor to the random walk smoothing 
of Zhou, Huang and Scholkopf (2005). For an undirected graph = Wji, 
and of course the in- and out-degrees of each node coincide. Let D be the 
diagonal matrix containing the common degree values Da = Wi+ = w+i . 
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They minimize 



h3 



2 

+ X\\Z-Y* 



which is the random walk smoothing criterion (9) after replacing the sim- 
ilarity Sij by the weight Wij and the stationary probability ui by the de- 
gree Da. Recall that for an irreducible aperiodic random walk on an undi- 
rected graph with transitions Pjj = Wij/wi+, the stationary distribution 
has 7Tj = Da/w++. Also, the similarity values become proportional to Wij\ 
Sy = (Du/w ++ )(wij j Do) + (Djj/w ++ )(wji/Djj) = 2wij/w ++ . As a result, 
the symmetrized version of (9) is equivalent to (15) after multiplying A by 
1/2. 

The criterion (15) fits the standard form (13) with L = D 1 / 2 AD l l 2 and 
A = XI, where A is the graph Laplacian of G. 

Their estimate reduces to the kriging estimator (7) with the following 
choices: 

r = A^ 1 /, 

£ = D 1/2 A+£> 1/2 , and 



A — (a/ Dn, . . ., \J D nn ) , 

in the limit as 5 — > 0. 



4.3. Example three: Zhou, Scholkopf and Hofmann (2005). Zhou, 
Scholkopf and Hofmann (2005) propose another random walk based strategy 
on directed graphs that is motivated by the hub and authority web model 
introduced by Kleinberg (1999). For Zhou, Scholkopf and Hofmann (2005), 
any node with an outlink is a hub and any node with an inlink is an au- 
thority. A node can be both a hub and an authority. They use two random 
walks. Their hub walk transitions between hubs that link to a common au- 
thority and their authority walk transitions between authorities linked by a 
common hub. 

The hubs define a walk on the authorities as follows. From authority i 
we pick a linking hub h with probability Whi/w+i and from there pick an 
authority j with probability Whj/wh + . The resulting transition probability 
from i to j is 

p(A) _ ST^ ^hi_ w hj 

a - Z-s w+ . ' Wh+ ' 

where the sum is over hubs h. Analogous hub transition probabilities are 

V {H) _ \- W ia Wja 
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summing over authorities a. 

The stationary distributions of these two walks have closed forms 

tc\ = Wi + /w ++ and ir\ ' = w + i/w ++ . 

These formulas give appropriate zeros for nodes i that are not hubs or au- 
thorities respectively. 

We use stationary distributions and Laplacians of these two walks. Let 

Hh = diag(7r| H ' ) , . . . ,7r^) and LT4 = diag^j;^, . . . ,7Tn )■ Then let Ah be 
the Laplacian of the graph Gh , which is our original graph G after replacing 
the weights Wij by the similarity = n^Pjj + 7rj P- t - . Similarly, let 

A a be the Laplacian of Ga which has weights = 717 + 7n P^f\ 

The hub and authority regularization of Zhou, Scholkopf and Hofmann 
(2005) uses the quadratic criterion (13) with A = XI and smoothing matrix 

(16) L = 7 u- l/2 A H u H l/2 + (1 - 7 )n" 1/2 A j4 n^ 1/2 

for some 7 G [0,1]. The choice of 7 allows the user to weigh the relative 
importance of inlinks and outlinks. 

Their hub and authority walk smoother matches the kriging estimator (7) 
with the following choices: 

r = A" 1 /, 

S = ( 7 U H 1/2 A H U H 1/2 + (1 - 7)n- 1/2 A A n- 1/2 )-\ and 

x = o n . 

Ordinarily, L is positive definite for < 7 < 1. The two terms in (16) each 
have one eigenvector with eigenvalue 0, but those two eigenvectors are, in 
general, linearly independent. We can construct exceptions. For example, if 
G is the complete graph, then the hub and authority walks coincide and 
L reduces to the random walk case which has one zero eigenvalue. More 
generally, if every node has Wi + = w + i, the same thing happens. Outside of 
such pathological examples, L is positive definite. 

4.4. Example four: Belkin, Niyogi and Sindhwani (2006). The manifold 
regularization framework introduced by Belkin, Niyogi and Sindhwani (2006) 
considers undirected graphs with similarity Sij = . They predict the re- 
sponses Y by 

Y = argmin \\Z\\ 2 K + -fZ'AZ + A ||Z (0) - Y (0) || 2 , 
z 

where K, is a Mercer kernel [see Cristianini and Shawe- Taylor (2000), Chap- 
ter 3], A is the graph Laplacian and 7 > 0. The term H-Z^^- controls the 
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smoothness of the predictions in the ambient space, while Z'AZ controls 
the smoothness with respect to the graph. We consider the special case 
where K. is a linear kernel. Then = Z'KZ for a positive semidefinite 

matrix K E M nxn . Now manifold regularization uses the criterion (13) with 
L = K + 7A and A = diag(Aoir> 0/ n _ r ). 

We have two cases to consider. The matrix 7A has n — 1 positive eigen- 
values and an eigenvalue of for the eigenvector l n . If Kl n = 0, then 
L = K + 7A is singular, but otherwise L is positive definite. 

When L is positive definite, the implied prior is not improper in any direc- 
tion so we take X = n . In this case, the manifold regularization predictions 
are from the kriging estimator (7) with the following choices: 

A^/ r \ 

Xi In—r J 
£ = (K + 7A)-\ and 

X = n , 

in the limit Ai — > 0. 

Now suppose that K + jA fails to be invertible because K has eigenvector 
l n with eigenvalue 0. In this case, we replace (-KT + 7A) -1 by the correspond- 
ing Moore-Penrose inverse and use X = l n , taking the limit 8 — > 0. 

Our condition that the Mercer kernel be linear is necessary. For a general 
Mercer Kernel /C, the prediction Y need not be linear in Y*, and so for such 
kernels, manifold regularization does not reduce to kriging. 

4.5. Other examples: smoothing matrix derived from A. A few papers 
[e.g., Kondor and Lafferty (2002), Smola and Kondor (2003), Zhu, Ghahra- 
mani and Lafferty (2003)] construct the smoothing matrix L based on a 
spectral transformation of the graph Laplacian A. They take 

n 

L = ^2f(d i )u i u' i , 
i=i 

where d{ and Uj are eigenvalues and eigenvectors of A as in (10), and /(•) 
is a nonnegative increasing function, such as f(x) = e a x l 2 . 
When f(d n ) > 0, the connection to kriging can be written as 

V A^/n-J' 

n 

T, = '^2f(d i )~ 1 u i u' i , and 

i=l 
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For f(d n ) = 0, r remains the same but now 



n-l 

s = ^2 f( d iy lu i u i and 
i=i 

X = 1„. 



with 5—^0. 



5. Empirical stationary correlations. In the last two sections we have es- 
tablished connections between kriging and several semi-supervised learning 
models for prediction on graphs. Such relationships are themselves inter- 
esting, but what is more striking to us is that the connections to kriging 
reveal a unanimous assumption by all these models: the signal covariance is 
a given function of the graph adjacency matrix W. This is clearly not an 
effective way to capture the various correlation properties that different re- 
sponse variables may present. For instance, even on the same social network 
(i.e., the same W), friends may correlate differently for age, than for gender, 
school attended or opinions about music, movies or restaurants. 

This troubling feature motivates us to propose a different model for the 
signal covariance that can adapt to the nature of the response variable. 
Similar to the prediction methods discussed thus far, we keep the kriging 
framework as presented in Section 2, but now we show how to adapt the 
covariance to the dependency pattern seen among the non-missing Y values. 

5.1. Stationary correlations. We start with the model for the underlying 
signal Z, 

(17) Z ~N(llX,o- 2 VRV), 

where the covariance is decomposed into a correlation matrix R £ M nxn , 
a diagonal matrix V = diag(ui, . . . ,v n ) containing known relative standard 
deviations Vi > 0, and a scale parameter a > 0. 

Model (17) includes both random walk regularization (Section 3) and the 
Tikhonov regularization (Section 4.1) as special cases. The connections are 
made in Table 1. 

The key element in (17) is the correlation matrix R. All of the meth- 
ods summarized in Table 2J;ake R to be a fixed matrix given by the graph 
adjacency matrix, via A, A, II and related quantities. Our model for Rij 
is p(sij), where p is a smooth function to be estimated using the response 
values, and Sij is a measure of graph similarity. These correlations are sta- 
tionary in s, by which we mean that two node pairs ij and i'f from different 
parts of the graph are thought to have the same correlation, if = Si'j> . 
The standard deviations avi, by contrast, are proportional to given num- 
bers that need not be stationary with respect to any feature of the nodes. 
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The signal means need not be stationary either. The estimation procedure, 
including measures to make R positive semidefinite, is described in detail in 
Section 5.2 below. 

Like these regularization methods, we assume in model (17) that X, v and 
Sij are prespecified based on domain knowledge. We take the noise variance 
to be A -1 /, like the random walk does, but unlike the Tikhonov method, 
which uses effectively infinite variance for the unmeasured responses. 

In the examples in Section 6, when we compare to the random walk 
method, we will use = TtiPij + T^jPji- Similarly, when we compare to 
the Tikhonov regularized method, we will use Sy = Wij, or symmetrized 
to Wij + Wji if the graph is directed. In this way we will use the exact same 
similarity measures as those methods do. There is one additional subtlety. 
We found it more natural to make the correlation a smooth function of 
similarity. For the other methods it is the inverse of the correlation that is 
smooth in Sj,-. 

In matrix form, our prediction is 

(18) Z = *. (^oo + A- 1 !)" 1 ^ " M*o) + fiX, 

where ^> = a 2 VRV, and we use the estimate R described next. 

5.2. Covariance estimation through the variogram. Here we adapt the 
variogram-based approach from geostatistics [see, for example, Cressie (1993)] 
to estimate the matrix R. With noise variance A -1 /, the variogram of the 
model (17) is 

$y = §E((Yj - fiXi) - {Yj - nXj)) 2 

(19) 

= A" 1 + \a 2 (v 2 + v 2 - 2viV j R ij ). 

For 1 < i,j < r both Yi = yi and Yj = yj are observed and so we have the 
naive estimator 

(20) % = \(( yi - fix-) - {y 3 - fiXj)) 2 . 

Table 1 

Parameters chosen for model (17) to obtain the 

random walk smoothing and the Tikhonov 
smoothing methods. Both models use the limit 
5^0 





X 


V 


a 2 Rij 


Random walk: 








Tikhonov: 


In 


In 
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Table 2 

Summary of connections between some semi-supervised learning methods and kriging 
Reference T S X Limits 

Zhou, Huang and Scholkopf (2005) A _ 1/ n i/ 2 ^+ n i/2 n i/2 1 S ^ Q 
(Random walk) 

Belkin, Matveeva and Niyogi (2004) / A" 1 /,. \ .+ 5^0 

(Tikhonov) y o X^In-r) n \l^0 

Belkin, Matveeva and Niyogi (2004) \ 5^0 

(Interpolated) ( „ ) A+ 1„ Ai 

Ao — > oo 

Zhou et al. (2004) A" 1 / D 1/2 A+D 1/2 D 1/2 l n S^0 



Zhou, Scholkopf and Hofmann (2005) \~ 1 r ((1 — i)^-a 1/2 ^a^ a 



-1/2 

(Hub & authority) +^n"- 1/2 A ff rr 1/2 



(Manifold, Kl n ^O n ) 



i =1 f(di) 1 u l u' 1 n 



Spectral transform 
/(d„)>0 

Spectral transform i a i r u vhn-i r,, \-i i 

f(d n ) = [ o Ar^-rT 1=1 













Ar 


1 /, 












a: 


1 /, 


A" 


x i 




A" 






A.7 1 /. 










a: 


1 /, 


Ao" 1 /. 










a: 


1 I, 


A.7 1 /. 










Ar 


!/, 


A.7 1 /. 










a; 


1 /, 



(A- + 7A)" 1 n Aj^O 



Belkin, Niyogi and Sindhwani (2006) fx^ 1 ^ \ A n+ 5 — > 

(Manifold, K1„ = 0„) I o \7 l I n -r) 7 A i ^ 



The naive variogram is our starting point. We translate it into a naive 
value Rij by solving equation (19). This requires the prespecified values of 
Vi and Vj. We also need values for A and a, which we consider fixed for now 
and will discuss how to choose them later. 

Once we have the naive correlation estimates Rij , we use a spline smoother 
to fit the smooth function Rij = p(sij). Smoothing serves two purposes. It 
yields correlation as a function of similarity Sij, and it reduces sampling 
fluctuations. Next we use p to estimate the entire correlation matrix via 
Rij = p(sij), for i 7^ j with of course Ru = 1. To complete our estimation of 
the signal variance, we take \I/ = a 2 VRV, and then if necessary modify ^ to 
be positive semi-definite. Two versions of the last step are considered. One 
is to use \E , + , the positive semi-definite matrix that is closest to ^> in the 
Frobenius norm. The other method is to use a low rank version of to 
save computational cost. 

The step-by-step procedure to estimate the signal covariance is listed in 
Table 3. The output is the estimated ^, which we use through equation (18) 
to make predictions. 

We choose a and A by cross-validation. This is the same technique used 
by the semi-supervised methods discussed in Sections 3 and 4. It is also 
similar to treatment of the shrinkage parameter used in ridge regression. 
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Table 3 

The steps we use to estimate the covariance matrix = a 2 VRV in model (17) via an 
empirical stationary correlation model 

Variance estimation with stationary correlations 

Given A > and a > 0: 

1. For every pair of observed nodes i,j = l,...,r and i ^ j, estimate Rij by solving (19) 
with $ij estimated using (20): 

(21) ^ = g V + tg)/2 + A-i-3«, 

a 2 ViVj 

2. Smooth the pairs {(-Rij , Sij) :i, j = 1, . . . , r} to obtain the estimated correlation function 

3. Compute Rij = p(sij) for j and Ra = 1. 

4. Set $ = a 2 VRV. 

5. Use one of the following two methods to make $ positive semi-definite. Let = U'HU 
be the eigen-decomposition of Then 

(a) use ^+ = U'H+U, where H+ = max(F, 0), or, 

(b) use V^v' = U'H+ U, where i/y 1 ' consists of the first k diagonal elements of H+ 
and the rest are set to be zero. 

Choice (a) gives the positive semi-definite matrix that is closest to in the Frobenius 
norm. Choice (b) is used when computational cost is a concern or the true covariance 
is believed to be low-rank. 



In our cross-validation, some known labels are temporarily treated as 
unknown and then predicted after fitting to the other labels. The entire 
graph structure is retained, as that mimics the original prediction problem. 
When estimating error rates we use training, test and validation subsets. 

5.3. Practical issues. As we have discussed, we need to make choices for 
X, v and Sij that go into our model (17). These prespecified values should 
come from domain knowledge about the response variable of interest. They 
may depend on the graph adjacency matrix W, but not on the realization 
of the variable Y. The single predictor X corresponds to the direction that 
Y varies along and v the amount of univariate variations. The similarity 
defines the closeness of nodes i and j. There are clearly many possible 
choices for these parameter values, and we do not yet have any guidance on 
how best to select them for a specific problem. 

The connections to the random walk and the Tikhonov smoothing meth- 
ods present two sets of example choices, as listed in Table 1. While v and 
X can be set separately, both methods take v = X, and so signals Zi with 
a larger absolute mean \fJ>Xi\ also have a larger variance (J 2 Xf. This seems 
reasonable but of course some data will be better fit by other relationships 
between mean and variance. One appealing feature of choosing v = X is that 



EMPIRICAL STATIONARY CORRELATIONS ON GRAPHS 



17 



(17) has a simple interpretation that the scaled signals Zi/Xi are Gaussian 
with constant mean, constant variance, and stationary correlation R. We 
will compare the two sets of choices using real examples in Section 6 below. 

It is worthwhile to point out that, for unweighted graphs, the Tikhonov 
smoothing method leads to very few distinct values of Sy. In this case, 
we simply use the average of Rij for each distinct Sj,- to approximate p 
without smoothing. For choices that lead to many distinct values of sy, 
cubic splines with ten knots are used to get p out of convenience. Better 
choices of smoothing method could probably be made, but we expect the 
differences among adaptive correlation methods to be smaller than those 
between methods with fixed correlations and methods with simple adaptive 
correlations. 

Finally, all measurement errors have the same variance A -1 in our model. 
It is unnecessary to assume a different noise variance for the unobserved Y 
because their variance does not affect our predictor (18). 

5.4. Relation to geo statistics. We use a nonparametric estimate /?(•) to 
avoid forcing a parametric shape on the correlation function. The parametric 
curves used in the geostatistics literature for R d with small d may not extend 
naturally to graphs, even if they could be properly embedded in Euclidean 
space. 

Both Hall, Fisher and Hoffmann (1994) and Shapiro and Botha (1991) 
have discussed ways to fit a nonparametric variogram while ensuring a pos- 
itive semi-definite covariance. Their techniques apply when the predictor 
space is M d . The usual definition of the similarity measure on a graph is far 
from being a metric in R d . Our approach ensures that the estimate for ^ is 
positive semi-definite. 

When there are n observations, Hall, Fisher and Hoffmann (1994) find 
convergence rates for the smoother p that are comparable to that using n 2 
observations. The reason is that we get 0(n 2 ) pairs (Yi,Yj) in the empirical 
variogram. In our application there are only r(r — l)/2 observed pairs to 
use. 

In the spatial smoothing problems where kriging originates, it is often 
necessary for the covariance to remain semi-definite at any finite list of 
points in M. d , including some that are not yet observed. Our setting does 
not require us to construct an extension of the covariance function to Y{ 
for nodes i that are not in the graph. Even in cross-validation, we know 
the positions in the graph for the points at which no Y values have been 
observed, and so we can still compute Sij for all data pairs. This aspect 
of the semi-supervised setting makes the problem much simpler than that 
faced in geostatistics. It does, however, mean that when the graph changes, 
the covariance model may have to be recomputed. 
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6. Examples. In this section we compare our empirical covariance ap- 
proach to the random walk and the Tikhonov regularization methods. We 
use two extremely different real data sets. The first one has a continuous 
response on a dense, weighted graph, and the second one has a binary re- 
sponse on a sparse, unweighted graph. Because both graphs are directed, 
we construct an undirected graph for the Tikhonov approach using W + W 
as the adjacency matrix. Our empirical-based method, together with its low 
rank variations, brings substantial improvements for both methods on both 
data sets. 

Recall that we need to prespecify the values of X, v and Sj,- for our 
empirical covariance approach. Following the discussion in Section 5.3, we 
consider two versions of our model. One follows the choices of the random 
walk method and the other follows the Tikhonov method, which we call 
"empirical random walk" and "empirical Tikhonov" respectively, compared 
to the "original random walk" and the "original Tikhonov." Therefore, the 
comparisons using real data serve two purposes. First, the comparison be- 
tween the empirical and the original shows how performance changes when 
we incorporate empirical stationary correlations. Second, the comparison 
between the two original (or empirical) methods shows how the choices of 
the prespecified parameters affect the predictions. 

We also need to estimate the overall mean \x. For binary problems with 
Yi £ { — 1,1} , we take fx = 0, as is done in the machine learning literature. 
For continuous responses we use 

i=l 

We also investigated estimating \i by generalized least squares regression 
of on J", taking account of estimated correlations among the first r 
response values. This made only a very small difference even on the small 
problems we are about to report, and so we see no reason to prefer it to 
the very simple estimate (22). We do want to point out that estimating fi is 
necessary for the empirical methods and the original random walk method, 
but not necessary for the original Tikhonov method. This is because even 
though fi is used in the construction of Y*, it disappears from Y* in the 
Ai — > limit for the original Tikhonov method. 

6.1. The UK university web link data set. 

Data description. The university data set contains the number of web 
links between UK universities in 2002. Each university is associated with 
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a research score (RAE), which measures the quality of the university's re- 
search. 4 After removing four universities that have missing RAE scores, or 
that have no in-link or out-link, there are 107 universities. 

The response variable, RAE score, is continuous and ranges from 0.4 to 
6.5 with a mean of 3.0 and a variance of 3.5. The number of links from one 
university to another forms the (asymmetric) weighted adjacency matrix W. 
The distribution of the weights Wij is heavily right tailed and approximately 
follows a power law. About 15% of the weights are zero, and 50% of them 
are less than 7, while the maximum is 2130. 

Illustration of the empirical covariance method. We first use the entire 
data set to illustrate the empirical variance estimation procedure as given 
in Table 3. We illustrate only the empirical Tikhonov method and hence use 
v = X = l n and Sjj = Wij + Wj{. These similarity scores take many values, 
and so we use correlation smoothing. The empirical random walk method is 
similar. In practice, a 1 and A are chosen by cross-validation, but we fix a 2 = 5 
and A -1 = 0.01 here to show one iteration of the estimation procedure. 

Figure 1 (left) plots the naive estimates Rij, as computed in (21), against 
(log transformed) similarity Sij values. The logarithm is used because the Sjj 
are skewed. The scatter plot is very noisy, but we can nonetheless extract a 
nontrivial p(-) with cubic spline smoothing (ten knots), as shown by the red 
curve. The same curve is also included on the right plot at a larger scale. 

It is striking that p(-) is not monotonically increasing in s^. The greatest 
correlations arise for very similar nodes, but the very least similar node 
pairs also have somewhat more correlation than do pairs with intermediate 
similarity. Recall that a similarity of means that the pair of universities 
are not linked. Universities without many links are overrepresented in such 
pairs, and those universities tend to have similar (low) RAE scores. 

The final step in Table 3 is to make the covariance matrix that directly 
results from p(-) positive semi-definite. For the full rank version we plot 
points ^ / +/(7 2 on the right side of Figure 1. These scatter around the red 
curve which shows p. We saw similar patterns (not shown here) with some 

low rank estimates . During this final step, we saw in Figure 1 (right) 
that a small number of highly similar node pairs got the greatest change in 
model correlation. That pattern did not always arise in other examples we 
looked at. 

Performance comparisons. Now we turn to performance comparisons. 
For this, we hold out the RAE scores of some universities and measure each 



4 The data are at http://cybermetrics.wlv.ac.uk/database/stats/data/. We use 
the link counts at the directory level. 
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UK University Data set 



CM 




log(sy + 1) log(s*j + 1) 



Fig. 1. Illustration of the empirical Tikhonov method with the UK university data. Left: 
scatter plot of the naive Rij values versus log(s;j + 1) with the cubic spline smoothing 
curve (red). Right: final estimates ty+ij/a 2 versus log(sij + 1) with the same smoothing 
curve (red). 

prediction method by mean squared error (MSE) on the held out scores. 
The size of the holdout set ranges from approximately 10% to 90% of the 
entire data set, and 50 trials are done at each holdout level. 

Our empirical methods have two tuning parameters A and a, while the 
original random walk and Tikhonov methods have only one. Nevertheless, 
the comparison is fair because it is based on holdout sets. For each set of 
held-out data we used ten-fold cross-validation within the held-in data to 
pick A and a for empirical stationary correlation kriging. For the original 
random walk and Tikhonov methods we use the best tuning parameter (A), 
and so our comparisons are to somewhat better versions of the random walk 
and Tikhonov method than one could actually get in practice. 

We define a baseline method that considers no signal covariance, and 
simply regresses the responses Y{ on X^. With the random walk choice of 
Xi = y/TTi, the baseline prediction is /iy^, while with the Tikhonov choice 
of Xi = 1, it is simply jX. 

The results are shown in Figure 2. The random walk method performs 
quite well compared to the Tikhonov method, but neither of them out- 
perform their corresponding baseline methods by much, even with the best 
tuning parameters. The black and red curves track each other closely over 
a wide range of data holdout sizes, with the red (graph-based) curve just 
slightly lower than the black (baseline) curve. 

The results show that the random walk choices v = X = ^/tt and Sij = 
TTiPij + TTjPji are clearly better than the Tikhonov choices v = X = l n and 
s^ = Wij + Wji for the UK university data. Another difference between the 
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Fig. 2. MSEs for the RAE scores at different holdout sizes. Left: the original random walk 
(red) compared with our empirical random walk (green). Right: the original Tikhonov (red) 
compared with our empirical Tikhonov (green). Baseline methods (black) are described in 
the text. 



methods is that the Tikhonov method symmetrizes the graph. As such, it 
does not distinguish between links from University i to j and links in the 
other direction. Even the baseline for the random walk method, which does 
regression on ^/tt, makes use of the directionality because that directionality 
is reflected within it. 

The green curves in Figure 2 show the error rates for the two versions 
of the empirical stationary correlation method. They generally bring large 
performance improvements, except at the very highest holdout levels for the 
Tikhonov case. Then as few as 17 University scores are being used and while 
this is probably too few to estimate a good covariance, it does not do much 
harm either. All the methods do better when less data are held out. The 
methods with data driven correlations have slightly steeper performance 
curves. 

We make a numerical summary of the curves from Figure 2 in Table 
4. We compare performance for the setting where about half of the data 
are held out. For both cases, kriging with empirical stationary correlations 
typically brings quite large improvements over the original methods. Low 
rank variations of empirical stationary correlation kriging perform similarly 
to the full rank empirical method, except for the rank 1 case in the random 
walk setting. There we still see a large improvement but not as much as 
for the full rank or rank 5 cases. The good performance of the low rank 
versions could reflect a small number of latent effects, or the benefits of 
regularization. 
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Table 4 

The relative improvement over baseline when 50 of 
107 ARE scores are held out. The baseline methods 

are simple regressions through the origin on 
X = t/ti (random walk) and on X — l n (Tikhonov) 





Improvement 


over baseline 




Random walk 


Tikhonov 


Baseline MSE 


1.71 


3.64 


Original random walk 


3.8% 




Original Tikhonov 




3.2% 


Empirical 


25.0% 


50.9% 


Empirical R5 


32.4% 


53.9% 


Empirical Rl 


19.1% 


50.9% 



6.2. The WebKB data set. The WebKB data set 5 contains web pages 
collected from computer science departments of various universities in Jan- 
uary 1997. The pages were manually classified into seven categories: student, 
faculty, staff, department, course, project and other. The data set we have 
is a subset, where the web pages belonging to the "other" class are removed. 
We will only use the data for Cornell University, which has 195 web pages 
and 301 links, after removing the three self loops. We further reduce the 
web page labels to be "student" (1) and "nonstudent" (—1). There are 83 
student pages in total. The adjacency matrix is unweighted, that is, Wij is 
1 if there is a link from page i to j and otherwise. Again, the links are 
directed and, hence, W is asymmetric, with 99.2% of the Wij being zero. 



5 The data are at http : //www. cs .umd.edu/projects/linqs/projects/lbc/ index .html . 

Table 5 

The relative improvement over baseline when 100 out 
of 195 web page labels are held out. The baseline 
AUC is 0.5 





Improvement 


over baseline 




Random walk 


Tikhonov 


Baseline (1 - AUC) 


0.5 


0.5 


Original random walk 


-5.4% 




Original Tikhonov 




8.5% 


Empirical 


43.0% 


37.5% 


Empirical R5 


40.0% 


31.9% 


Empirical Rl 


29.0% 


16.3% 
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Fig. 3. Classification error for web page labels at different holdout sizes, measured with 1 
minus the area under the ROC curve. Left: the original random walk (red) compared with 
our empirical random walk (green). Right: the original Tikhonov (red) compared with our 
empirical Tikhonov (green). The baseline method is random guessing. 



The kriging models make continuous predictions of the binary response. 
We use the area under the ROC curve (AUC) to measure performance on the 
holdout sets. The AUC is equivalent to the probability that a positive label 
will get a higher prediction than a negative label. To estimate the correlation 
function in the empirical based method, we again use cubic splines with ten 
knots for the random walk Sij. However, for the Tikhonov Sjj, which has only 
three possible values 0, 1 and 2 in an unweighted directed graph, we simply 
use the average at each without smoothing. The tuning parameters are 
picked in the same way as for the university data set. 

The results are plotted in Figure 3 and summarized in Table 5. As a 
baseline, we consider a model which sorts the web pages in random order. It 
would have an AUC of 0.5. For the WebKB data, the Tikhonov method has 
better accuracy than the random walk method which actually has trouble 
getting an AUC below 0.5. It is interesting that in this case the method which 
ignores link directionality does better. In both cases empirical stationary 
correlations bring large improvements. As before, we see that larger amounts 
of missing data make for harder prediction problems. 

7. Variations. In many applications we may want to use more nuanced 
error variance measures, such as r = diag(crf , . . . , er„), and this fits easily 
into the kriging framework. For example, web pages determined to be spam 
after a careful examination could be given a smaller of than those given less 
scrutiny, and those not investigated at all can be given a still higher af. 
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Sometimes we can make use of an asymmetry in the labels. For example, 
positive determinations, for example, Is, may have intrinsically higher con- 
fidence than negative determinations, —Is, and we can vary er, to account 
for this. Similarly, when one binary label in ±1 is relatively rare, we could 
use a value other than as our default guess. 

Finally, it is not necessary to have v = X, where v appears in the variance 
model through a 2 VRV with V = diag(u) and X in the model for the mean 
through fiX. We use v = X in the examples in Section 6 because this is 
the choice of the random walk and the Tikhonov methods. Also, we could 
hybridize the Tikhonov and random walk models, using v = X = l n from the 
former inside the regression model with the edge directionality respecting 
covariance of the latter. 

8. Other related literature. We have so far focused on the graph-based 
prediction methods from the machine learning literature. We would like to 
point out a few related works in some other fields as well. 

In the social network literature, researchers have built network autocor- 
relation models to examine social influence process. For more details see, 
for example, Leenders (2002) and Marsden and Friedkin (1993). A typical 
model is as follows: 

(23) Y = X/3 + lj, uj = aBu + e, 

where a is the network autocorrelation parameter, B is a weight matrix and 
e ~ A/"(0, a 2 1). This model is mainly used for estimating or testing a and /3, 
but we could of course use it for prediction purpose as well. Notice that we 
can write model (23) as 

Y ~N{X(3,a 2 AA'), 

where A = (I — aB)~ l . Comparing to the other models we have discussed 
so far, Y here is no longer a noisy measurement of some underlying quantity 
Z. The covariance a 2 AN depends on a scaled weight matrix aB. Leenders 
(2002) discusses a few ways to construct the weight matrix B, but all of 
them involve only the graph adjacency matrix and some a priori quantities. 
Nevertheless, the autocorrelation scale a, which is estimated from data, can 
incorporate some empirical dependence from the observed Y. 

Heaton and Silverman (2008) consider prediction at unobserved sites in 
IR 1 or M 2 . The underlying function Z is assumed to have a sparse wavelet 
expansion, which they utilize within an MCMC framework to generate a 
posterior distribution for the unobserved Y. Their method is shown to have 
better performance in neighborhoods containing discontinuities where other 
methods, for example, kriging, would smooth. While this method applies to 
data in Euclidean space with the regular wavelet transform, Jansen, Nason 
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and Silverman (2009) discuss a potential extension to data arising on graphs 
using the wavelet-like transform they introduce. 

Finally, Hoff, Raftery and Handcock (2002) model the relational tie be- 
tween a pair of nodes in a social network by introducing a latent position 
for each node in a low dimensional Euclidean space. Handcock, Raftery and 
Tantrum (2007) then propose a(n) (unsupervised) clustering method by as- 
suming these latent positions arise from a mixture of distributions, each 
corresponding to a cluster. Of course, we can also see the potential to utilize 
these latent positions in Euclidean space kriging methods to make predic- 
tions. 

9. Conclusion. We have shown that several recently developed semi- 
supervised learning methods for data on graphs can be expressed in terms 
of kriging. Those kriging models use implied correlations that derive from 
the graph structure but do not take account of sample correlations among 
the observed values. 

Our proposed empirical stationary correlation model uses correlation pat- 
terns seen among the observed values to estimate a covariance matrix over 
the entire graph. In two numerical examples we saw that using empirical cor- 
relations brought large improvements in performance. Even when there were 
large differences between the performance levels of different semi-supervised 
methods, the use of empirical correlations narrowed the gap. This reduces 
the penalty for the user who makes a suboptimal choice for X, v and Sij. 

The stationary correlation model was motivated by the idea that the 
correlations should be some unknown monotone function of similarity, and 
that given enough data, we could approximate that function. We were mildly 
surprised to see a nonmonotone relationship emerge in our first example, 
though it was interpretable with hindsight. We do not have a way to test 
models of this kind, beyond using cross-validation to choose between two of 
them. 

We have not implemented our method on any large scale problems. Large 
scale presents two challenges. First, solving equations with an n x n matrix 
is expensive. Second, the number of correlation pairs Rij to smooth is large. 
Reduced rank correlation matrices will mitigate the first problem. We might 
further benefit from the sparsity of Sj,- by writing the covariance ^ as sum 
of a sparse matrix and a rank one matrix. The second problem only arises 
when the number r of labeled cases is large. Large r is much rarer than large 
n, and in any case can be mitigated by downsampling the correlation pairs 
before smoothing. In our examples covariance estimates derived from quite 
small numbers of observation pairs still performed well. We finish by pointing 
out that there are a good many smaller data sets to which semi-supervised 
learning on graphs may be applied. 
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